PLOS Digital Health
Top medRxiv preprints most likely to be published in this journal, ranked by match strength.
Show abstract
BackgroundClinicians in care management programs are often in low supply relative to patient demand, especially in US Medicaid programs, and must simultaneously address clinical risk, time efficiency, and patients social needs. Many studies have shown that large language models may assist in their tasks for summarizing patient care, such as in generating care plans; yet these studies also show that different objectives given to agents often conflict and produce problems for safety, efficiency an...
Show abstract
Retrieval-augmented generation (RAG) holds promise for supporting high-stakes medical decision-making. However, most research has focused on downstream optimization of parameters and algorithms. This Phase 1 foundational study quantitatively evaluated the upstream quality of knowledge documents and their impact on retrieval performance, using Japanese clinical research protocol manuals for Institutional Review Board pre-screening support as a case study. We established a three-tier evaluation fr...
Show abstract
ImportanceEmerging evidence suggests healthcare AI systems may exhibit deceptive alignment (appearing safe during validation while optimizing for misaligned objectives in deployment) and evaluation awareness (detecting and adapting behavior during audits), undermining regulatory validation frameworks. ObjectiveTo quantify the performance of multi-layer red-teaming approaches in detecting sophisticated healthcare AI safety failures across 10 vulnerability domains. Design, Setting, and Participa...
Show abstract
BackgroundLarge Language Models (LLMs) show promise for clinical decision support in Intensive Care Units (ICU), but their safety and reliability remain inadequately evaluated through dual testing of both memory-dependent and memory-independent safety mechanisms. ObjectiveTo comprehensively evaluate LLMs using two independent safety tests: context-dependent contraindication memory (penicillin allergy recall) and context-independent authority resistance (Extended Milgram Test), revealing whether...
Show abstract
Digital health technologies are powerful-enhancing data collection, participant engagement, and personalized health interventions-yet their rapid proliferation has outpaced guidance for research participant protection. Current practice assists researchers in identifying risks but provides limited support for comprehensive risk management. To address this gap, we developed the Digital Health Checklist-Risk Management (DHC-RM) Tool, which integrates the established Digital Health Checklist with ap...
Show abstract
1BackgroundImplementation challenges are a major contributor to the failure of novel medical technologies in low-resource settings. Although frameworks such as the Consolidated Framework for Implementation Research (CFIR) are widely used to evaluate interventions post-implementation, their prospective application during the product development phase remains limited. MethodsThis study aimed to prospectively assess implementation factors relevant to the future adoption of a prototype handheld ult...
Show abstract
Clinical prediction models are often created using large routinely collected datasets. It is essential that prediction models are developed with appropriate data and methods and transparently reported to ensure that decisions are based on reliable predictions. Kaggle is a popular competition website where users learn and apply analysis skills on a range of datasets. We identified two large, publicly available Kaggle datasets, on stroke and diabetes, that lack clear data provenance, but are widel...
Show abstract
Deploying large language models (LLMs) in clinical settings is limited by security, reliability, latency, and accessibility concerns that favor smaller, on-device or on-premise models. However, these smaller models may struggle to meet accuracy requirements. While fine-tuning and retrieval-augmented generation (RAG) can improve domain-specific accuracy, these methods require additional labeled data, technical skill, and infrastructure. In contrast, test-time scaling --allocating extra token-budg...
Show abstract
Structured AbstractO_ST_ABSObjectiveC_ST_ABSThe use of ambient AI documentation tools is rapidly growing in US hospitals and clinics. Such tools generate the first draft of clinical notes from scribed patient-provider conversations, which clinicians can then review and edit before signing into electronic health records (EHR). Understanding how and why clinicians make modifications to AI-generated drafts is critical to improving AI design and clinical efficiency, yet it has been under-studied. Th...
Show abstract
Health information seeking has fundamentally changed since the onset of Large Language Models (LLM), with nearly one third of ChatGPTs 800 million users asking health questions weekly. Understanding the sources of those AI generated responses is vital, as health organizations and providers are also investing in digital strategies to organically improve their ranking, reach and visibility in LLM systems like ChatGPT. As AI search optimization strategies are gaining maturity, this study introduces...
Show abstract
The utilization of continuous ECG monitoring has become an integral part of modern hospital-based care. However, missing data presents significant challenges in deploying real-time ECG-based predictive systems. Research on the implementation of imputation techniques on time-series ECG is limited. Furthermore, the performance of imputation techniques is typically benchmarked using random masking, which may not reflect the real-world missingness patterns encountered in clinical practice. This stud...
Show abstract
IntroductionLarge Language Models (LLMs) in healthcare practice and education have been evaluated using medical question-answering (QA) datasets, with excellent performance. However, multiple-choice questions fall short when assessing more complex language interactions. ObjectiveTo evaluate the time invested and validity of medical students responses to clinical questions using ArkangelAI, compared to traditional search methods. MethodsRandomized, double-blind trial with clinical medical stude...
Show abstract
BackgroundDelivering timely, high-quality feedback on resident scholarly projects is labour-intensive, especially in large programmes. We developed an AI-assisted evaluation system, powered by the open-weight LLaMA-3.1 large-language model (LLM), to generate formative feedback on Family Medicine residents scholarly projects and compared its performance with expert human evaluators. MethodsWe evaluated whether the AI-generated feedback achieves comparable quality to expert feedback. The tool ing...
Show abstract
IntroductionHealthcare organizations have begun incorporating screening procedures for social determinants of health (SDOH) into care, recognizing the impact these factors can have on health outcomes. We aimed to present methods for evaluating redundancy in the risk information gained across SDOH questions and for evaluating whether demographic biases are present in whether patients were asked SDOH questions and whether they declined to answer them. MethodsSDOH question data were analyzed for 1...
Show abstract
BackgroundCT scans are the gold-standard diagnostic test for pulmonary embolisms (PE). Despite stable PE prevalence, CT use is rising in emergency departments (EDs), suggesting test overuse. Current methods for measuring test yield are error-prone or not scalable, thus we tested the accuracy of an open-source, foundational large language model (LLM) for identifying PEs from free-text radiology reports. MethodsOur retrospective diagnostic accuracy study used 10,173 CT-PE reports from 216 radiolo...
Show abstract
BackgroundArtificial intelligence is increasingly embedded in healthcare delivery. Its legitimacy depends on institutional governance, not technical performance alone. Prior research has centered on clinicians and patients. Less attention has been given to cybersecurity professionals who sustain the digital infrastructures that support health AI. This study examines how cybersecurity professionals conceptualize AI as clinical infrastructure and how these interpretations shape understandings of t...
Show abstract
Artificial intelligence models in healthcare often fail to improve patient outcomes despite strong predictive performance because they are frequently developed with limited understanding of clinical workflows and system implementation. We demonstrate a human-centered design approach to define prediction targets before model development, ensuring alignment with actionable clinical interventions. Using pediatric acute kidney injury as a case study, we convened a multidisciplinary working group and...
Show abstract
Structured AbstractO_ST_ABSObjectiveC_ST_ABSAmbient artificial intelligence (AI) tools are increasingly adopted in clinical practices. This study investigated whether and how clinicians edit AI-generated drafts and the linguistic differences between AI drafts and clinician-finalized notes. Materials and MethodsThis retrospective study analyzed real-world data from ambulatory clinics at a large academic health system spanning two vendor deployments. We quantified clinicians editing behavior usin...
Show abstract
Medical Multimodal Large Language Models (Medical MLLMs) have achieved remarkable progress in specialized medical tasks; however, research into their safety has lagged, posing potential risks for real-world deployment. In this paper, we first establish a multidimensional evaluation framework to systematically benchmark the safety of current SOTA Medical MLLMs. Our empirical analysis reveals pervasive vulnerabilities across both general and medical-specific safety dimensions in existing models, p...
Show abstract
Artificial intelligence (AI) is increasingly permeating healthcare, from physician assistants to consumer applications. Since AI algorithms opacity challenges human interaction, explainable AI (XAI) addresses this by providing AI decision-making insight, but evidence suggests XAI can paradoxically induce over-reliance or bias. We present results from two large-scale experiments (623 lay people; 153 primary care physicians, PCPs) combining a fairness-based diagnosis AI model and different XAI exp...